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Abstract 

In the framework of a lattice-model study of protein folding, we investigate the inter- 
play between designability, thermodynamic stability, and kinetics. To be "protein-like", 
heteropolymers must be thermodynamically stable, stable against mutating the amino- 
acid sequence, and must be fast folders. We find two criteria which, together, guarantee 
that a sequence will be "protein like" : i) the ground state is a highly designable stucture, 
i. e. the native structure is the ground state of a large number of sequences, and ii) 
the sequence has a large A/F ratio, A being the average energy separation between the 
ground state and the excited compact conformations, and T the dispersion in energy of 
excited compact conformations. These two criteria are not incompatible since, on av- 
erage, sequences whose ground states are highly designable structures have large A/F 
values. These two criteria require knowledge only of the compact-state spectrum. These 
claims are substantiated by the study of 45 sequences, with various values of A/F and 
various degrees of designability, by means of a Borst-Kalos-Lebowitz algorithm, and the 
Ferrenberg-Swendsen histogram optimization method. Finally, we report on the reasons 
for slow folding. A comparison between a very slow folding sequence, an average folding 
one and a fast folding one suggests that slow folding originates from a proliferation of 
nearly compact low-energy conformations, not present for fast folders. 
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1 Introduction 



Among the set of all possible linear amino-acid heteropolymers, only very few are "protein like" . For 
a heteropolymer to be "protein like", three requirements must be met: (i) The heteropolymer must 
be thermodynamically stable: at thermal equilibrium, it must spend a large fraction of its time in 
the ground state. (Anfissen ||l| has shown the ground state of a protein to be the biologically active 
configuration.) (ii) The heteropolymer must have a fast folding time: the native state should be 
kinetically accessible, typically within milliseconds to seconds for real proteins, (iii) The "protein- 
like" heteropolymer must be stable against mutations: if an amino acid is mutated into another one, 
the native structure should typically be preserved. 

Why are some sequences of amino acids "protein like" while others are not? Since theoretical 
methods cannot yet reliably find the ground state of real amino-acid chains, we address this ques- 
tion within a simple lattice model. Lattice models have been widely used in the study of protein 
folding dynamics [Q, ^, ^, ^, 0, ^ . The main ingredients of these lattice models are (i) the protein 
is viewed as a heteropolymer on a cubic lattice, and (ii) non-covalently-bonded nearest neighbor 
monomers experience an interaction that depends on monomer type. In one class of lattice mod- 
els, the interactions between adjacent monomers are chosen as random variables from a continuous 
probability distribution (see for instance @, §, §). We adopt another approach, namely a so-called 
H-P model, where the monomers come in only two types, H(ydrophobic) or P(olar) ^. The main 
physical motivation for studying H-P models is that the specific ground-state configuration of real 
proteins appears to be largely determined by optimal burial of hydrophobic amino acids away from 
water [^]. It was also shown from an analysis of the Miyazawa and Jernigan matrix that 
the uncharged amino acids fall into two sets: hydrophobic and polar, according to their affinity for 
water. Moreover, there is experimental evidence that the native structures of certain proteins are 
stable when hydrophobic amino acids are substituted within the hydrophobic class and polar amino 
acids are substituted within the polar class|12]. The small number of possible interactions in an H-P 
model (H-H, H-P, and P-P) and the finite number of possible sequences of a given length provide 
realistic constraints on the design of particular structures. Li et al. ||T^ took advantage of these 
constraints to study some design properties of an H-P model in the complete space of possible se- 
quences. In particular these authors introduced the concept of designability. In the terminology of Li 
et al, the designability of a given compact structure is defined as the number Ns of sequences that 
have this structure as their nondegenerate compact ground state; a highly designable structure is the 
nondegenerate ground state of an atypically large number of sequences. These authors reached the 
conclusion that (i) highly designable structures are likely to be thermodynamically stable since they 
have a large gap in their compact state spectrum, (ii) highly designable structures are likely to be 
stable against point mutations, and (iii) highly designable structures have protein-like motifs. These 
observations suggest that Nature's selection of protein structures is not accidental but a consequence 
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of thermodynamic stability and stability against mutations. 

The aim of the present article is to push further the analysis of Li et al. In particular, the study 



in 1 13 1 is limited to the compact-state spectra of H-P heteropolymers of 27 monomers length (the 
compact states of which fill a 3x3x3 cube), and the dynamical aspects are not discussed. Here we 
address thermodynamic and dynamic properties including all states, not only the compact ones. Of 
course, our present study is therefore limited to a small number of sequences (45), in contrast to the 



complete enumeration in |13]. More precisely, we address the following questions: 



• (i) In the compact-state-spectrum studies, thermodynamic stability is measured by the gap 
in the compact-state spectrum. By gap, we mean the energy difference between the first 
excited compact state(s) and the ground compact state. How does the compact-state-spectrum 
gap correlate with the "true" thermodynamic stability where all the possible conformations 
(including the open and partially open ones) are taken into account? Also, how does the 
"true" thermodynamic stability correlate with the degree of designability? 

• (ii) How does the folding time correlate with the compact-state spectrum? An answer to 
this question was postulated in a small gap in the compact-state spectrum leads to low 
foldability and a large gap in the compact-state spectrum leads to fast folders. It was shown in 
1^ that this postulate is wrong in the framework of a random interaction model: even though 
sequences with a large compact state gap are fast folders, sequences with a small compact state 
gap can fold either slow or fast. Is the postulate right or wrong for H-P heteropolymers? 

• (iii) Are highly designable structures also fast folders ? Sequences that have highly designable 
ground states and thus are stable against mutations are also typically thermodynamically stable 
according to the analysis of the compact-state spectrum carried out in Ref. |l^]. Moreover, 
Vendruscolo et al. [0] showed, in different lattice models, that both types of stability are 
equivalent. One would like to know whether the requirement of fast folding dynamics introduces 
new constraints on the set of "protein-like" sequences. Finally, one can ask why are some 
sequences fast folders while others are slow? 

This article is organized as follows: section ^ is devoted to the details of the model and the technical 
aspects of the simulations. We implement several techniques that were developed in the context of 
statistical mechanics of spin models. These techniques are next used in sections ^ and § to analyze 
the thermodynamics and dynamics of a set of sequences. Section ^ summarizes our answers to the 
questions presented above. 

2 Model and simulation techniques 

The aim of this section is to briefly recall the deflnition of the heteropolymer H-P model, and to 
describe the simulation techniques. 
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Figure 1: A conformation of a 15-mer in two dimensions. 



2.1 The model 

In the model analyzed in |jl^, a protein is a self-avoiding walk of monomers on a cubic lattice. A se- 
quence is defined by the set {cjj} of amino acids along the chain, where crj G {i?(ydrophobic), P(olar)}, 
and i runs over the monomers along the chain (i = 1, ...,27 for a chain which folds into a compact 
3x3x3 cube). The different structures are defined by assigning a position rj to the i-th monomer 
on the cubic lattice, such that the distance between two consecutive monomers is equal to unity 
and two monomers cannot lie at the same site (self-avoidance condition). Given a sequence a, the 
Hamiltonian is 

H = J2E^^,^^Air„r,), (1) 

i<j 

where Eh,h, Eh,p, and Ep^p are the energies of H-H, H-P, and P-P contacts, respectively, and 
A(rj,rj) = 1 if Fj and Vj are nearest-neighboring sites with i and j not adjacent along the chain, 
and zero otherwise. For instance, in the case of the two dimensional conformation of Fig. ||, there 
are 4 H-H contacts, 1 H-P contact, and 3 P-P contacts and the energy of this conformation is thus 
AEh.h + Eh.p + 3Epp. The present model differs from the H-P models in |2|, ^ since we take the 
interaction energies Eh.h, Eh.p, and Ep^p to be all different from each other. ^From the analysis of 
the Miyazawa-Jernigan matrix for real proteins, it was shown in |1C] that Eh,h < Ei{,p < Ep^p and 
{Eh,h + Ep^p)/2 < Eh,p- This last condition expresses the fact that different types of monomers 
tend to segregate. Following |13], we choose Eh,h = —2.3 — Ec-, Eh,p = — 1 — Ec and Ep^p = —Ec- 
These dimensionless coefficients define the energy scale in which the temperature will be measured. 
An increase of the compactness energy Ec tends to favor compact conformations with respect to open 



ones. The results of [13] concerning the compact state spectrum are of course independent of Ec- 
In the present work, the open, partially open, and compact states are all taken into account in the 
numerical calculations, so that the results presented here depend on Ec- The determination of Ec 
from an analysis of the Miyazawa and Jerningan matrix along the lines of [|l^] is rather imprecise, 
but such an analysis suggests that Ec is of the order of 2 or 3. ^From the point of view of our 
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calculations, we have chosen the overall drive to compactification Ec to be large enough so that (i) 
the average compactification time (being the average time to first reach a compact state) is much 
smaller than the average folding time (being the average time to first reach the ground state) and (ii) 
very few sequences have a non-compact ground state. However, Ec should not be too large since, as 
was shown in [0], as Ec increases above a certain threshold, the folded phase gives way to a collapsed 
glassy phase. In practice, after preliminary simulations, we have chosen Ec = 1.5. 

2.2 Simulation techniques 
2.2.1 Dynamics 

The most commonly used algorithm in three-dimensional lattice-heteropolymer folding dynamics is 
the Monte-Carlo algorithm with two one-monomer moves (end moves, corner moves) and one two- 
monomer move (crankshaft move) [see Q for a pedagogical description of this algorithm]. However, 
for the model we simulated, this algorithm spends most of the time refusing moves, leading to large 
computation times. Hence, we chose to implement a Bortz-Kalos-Lebowitz (BKL) type algorithm 
[15 1, an algorithm especially efficient in the presence of slow relaxation. In contrast to the standard 



Monte-Carlo algorithm, this algorithm never rejects moves. In practice, one keeps track of all the 
possible states a that can be reached after the three types of moves listed above, starting from a 
state oq. Once the list of all possible moves is established, together with the Monte-Carlo transition 
probabilities, 

P„„^„ = Min f exp f-^E^) , i) , (2) 



T 

one move is picked according to its relative transition probability and that move is then performed. 
The time spent in the state ao is randomly chosen from the exponential distribution 



PiTao) = exp - V Pao->a V, • (3) 




One need not recalculate this list of possible final states at each step, but instead one updates this 
list by canceling moves that are no longer possible and adding new moves that become possible. 
The energy costs AEa^^a of all these moves are also updated. This algorithm is especially efficient 
for slow and average folders, but still does not allow systematic studies at low temperatures. One 
unit time of this BKL algorithm (noted BKL unit in the following) corresponds to 27 Monte-Carlo- 
Steps (MCSs) of the usual Monte-Carlo algorithm (in lattice heteropolymer folding and for the usual 
Monte-Carlo algorithm, one MCS usually corresponds to attempting to move one monomer). 

2.2.2 Thermodynamics 

One possibility to compute thermodynamic quantities is to carry out an exhaustive enumeration of 
all the possible conformations, including the noncompact ones. Such an exact method was used for 
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instance in in the case of a 15-mer in three dimensions, and in |14| for a 16-mer in two dimensions. 
Since, in practice, we could not apply this exact method to 27-mers in three dimensions, we have 
used Monte-Carlo histogram techniques to study the thermodynamics. 

Several groups working on proteins have rediscovered the Monte-Carlo histogram technique and 
applied it to heteropolymer models of proteins. We refer the reader to for a clear description of this 
technique. Most simply, the technique consists of carrying out a simulation at a given temperature 
To and keeping track of the histogram of various quantities, for instance the joint probability of 
the energy E, number of contacts, and similarity with the ground state. Using this histogram 
calculated at Tq, one can recalculate several thermodynamic quantities at another temperature T by 
changing the Boltzmann weight to exp (—E/T) without carrying out a new simulation. However, T 
should remain close to Tq since the simulation at temperature Tq is carried out over a finite time 
and the phase space is thus only partially sampled. We did not in fact use this "naive" histogram 



technique but rather made use of the powerful method invented by Ferrenberg and Swendsen |16, 17 1. 
This method consists in optimizing several histograms calculated at different temperatures to obtain 
the temperature dependence of various thermodynamic quantities. Moreover, the accuracy of the 
optimized thermodynamic quantities can be evaluated, and additional simulations can be carried 



out when necessary. Following |16|, we briefly summarize this technique by focusing on the case 
of the single energy histogram, the generalization to joint histograms of various quantities being 
straightforward. We consider R energy histograms labeled by n = 1, ..R, carried out over tn time 
units at the temperatures Ti, ...,r„, ...,Tji, with the normalization 

tn = J2^n{E), 
E 

where Nn{E) is the number of times a state (or states) with energy E is sampled in the n-th histogram. 
The partition function Z{T) is expanded over the density of states W{E) as 

Z{T) = Y,W{E)exp{-E/T). 

E 

Each of the R histogram simulations carried out at a temperature T„ leads to a "naive" estimate of 
the density of states 

Wn{E) = t-^Nn{E) exp {E/Tn - Fn/Tn)^ (4) 



with Fn the free energy at temperature r„. The Ferrenberg and Swendsen method ||16| consists in 
expanding the density of states W{E) as a combination of the "naive" densities of states @ 

R 

W{E) = Y,Pn{E)Wn{E), 
n=l 

the coefficients of this expansion (as well as the free energies in (|^) being determined by minimiz- 
ing the error in this estimation of W{E). The result is a closed set of multiple-histogram equations 
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for the Fn IH 



En=l tn exp i-E/Tr, + Fn/T^) 

exp(F„/r„) = ^P(^,r„). (6) 



Once @ and (^) have been solved by successive iterations, the average of an energy-dependent 
operator is calculated as 

3 Thermodynamics 



The compact state spectrum was previously determined by means of exact enumerations [13|. Thanks 
to this work, we know for each sequence the lowest energy compact state. However, it is possible that 
the true ground state is not compact. We therefore checked during the histogram calculations that 
no open state has a lower energy than the compact ground state. We eliminated a few sequences 
with noncompact ground states. Since we were not able to perform an exact enumeration of the 
noncompact states, we cannot exclude the possibility of a noncompact ground state that was not 
found during the simulations. However, we believe that this possibility is rather unlikely. We will 
comment later on the characteristics of those few sequences with a noncompact ground state. 

3.1 Qualitative study of a thermodynamically stable and unstable sequence 

We first begin with a qualitative analysis of a "protein-like" and a "non-protein-like" sequence, 
from the point of view of thermodynamics. As stated in the introduction, one necessary condition 
for a sequence to be "protein-like" is that it be thermodynamically stable. Hence, we have to 



find a quantitative way to measure thermal stability. It has been proposed |18| that real proteins 
undergo a folding transition at some temperature Tf larger than the glass transition Tg, below which 
the dynamics dramatically freezes. For "protein-like" sequences, folding should be a first-order-like 
transition, with Tf much larger than Tg, thus allowing the possibility of a temperature regime in which 
the ground state is thermodynamically favored and kinetically accessible. By "first-order-like", we 
mean that as a function of temperature the native state occupancy has a narrow transition from a 
low value in the unfolded phase to a high value in the folded phase. We will use the width of this 
transition as a measure of thermodynamic stablility, the thermodynamically stable sequences having 
a narrow transition. The first-order-like behavior of the folding transition was put on a quantitative 
basis for a lattice H-P model by Socci and Onuchic 0. These authors studied the shape of the energy 
histogram and observed a transition from a unimodal shape to a bimodal shape as the temperature 
was lowered below Tf, suggesting the occurrence of a first-order- like transition. 
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On the other hand, in the case of a thermodynamically unstable sequence (and thus one that is 
not "protein-hke" ) , there are low- lying states competing with the ground state, and the transition 
to the folded ground state as the temperature is decreased is thus broad. In order to compare 
different sequences, we measure thermodynamic stability in terms of the width in temperature of the 
transition to the native state compared to the transition temperature. In biological situations, the 
temperature is usually fixed and the notion of thermal stability is then not only sequence-dependent 
but also temperature-dependent. For instance, some sequences could have a narrow transition to the 
folded state but with too low a Tj, so that in practice, the protein is unfolded at the temperature 
of interest. Our definition of thermodynamic stability is thus, in the context of real proteins, only a 
necessary condition. 

Following Socci and Onuchic ||^, we employ the optimized histogram method described in section 



2.2.2 , and determine the probability V25{T) that the sequence is in any conformation having more 
than 25 correct ground-state contacts, or "native" contacts, out of 28 possible contacts for the fully 
folded structure. Throughout this paper, this set of conformations will be referred to as "native 
states". We will consider a heteropolymer as correctly folded if the number of native contacts is 
larger than or equal to 25. This assumption speeds up computations and is sensible from the point of 
view of real proteins since structure fluctuations around the folded conformation are expected, and 



do not prevent the protein from being functional (see for instance [19|). The variation of 7^25(7") as a 
function of temperature T are shown on Fig. ^ for the thermodynamically stable sequence A and the 
thermodynamically unstable sequence C. [These two sequences will be studied in detail in the present 
article] . The larger width of the transition to the folded conformation in the case of sequence C is due 
to the existence of low-lying semi compact conformations, not present for thermodynamically stable 
sequences. The consequences of these low-lying states for the folding dynamics will be investigated 
in what follows. 

3.2 Thermodynamic stability: compact state spectrum versus transition width 

In order to answer the first question in item (i) in the introductory section, we want to compare the 
compact-state-spectrum estimation of the thermodynamic stability, in terms of the folding temper- 
ature Tf and the glass-transition temperature Tg |18, to the thermodynamic stability estimated 



with the method of section 3.1 



3.2.1 Compact-state-spectrum results 



We first review the results of Goldstein et al. [18|, and Bryngelson et al. |20|, who estimated the 



folding temperature Tf and the glass temperature Tg in terms of simple spectral quantities. These 
authors derived their criteria from the complete energy spectrum, dividing phase space into two 
components: on the one hand, the native state with a nearly zero entropy and, on the other hand, 
uncorrelated "liquid-like" states at higher energy with a large entropy. If A denotes the average 
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Figure 2: Probability V2^{T) to be in any state having more that 25 native contacts as a function 
of temperature for the thermodynamically stable sequence A and the thermodynamically unstable 
sequence C. 



energy difference between the native and the liquid-like states, and T is the width in energy of the 
liquid- like states, it is shown in [p^ |20| that the ratio Tf /Tg is maximal if A/F is maximal. In other 
words, the ratio A/F should measure how "protein-like" a sequence is. We take here as a working 
hypothesis the application of the results in [18, |2^ to our lattice model including the compact states 



only. In order to test the hypothesis, we calculate A and F from the compact-state spectra, the 
compact states being conformations exactly filling the 3x3x3 cube, and the compact-state spectrum 
of a given sequence being the set of energies of all compact conformations. More precisely, 

A = ^E(^"-^o) (7) 

<^ a>0 \ a>0 / 

where the Ea (a > 0) denote the energies of the excited compact conformations, Eq is the lowest 
compact state energy, and Nc is the number of excited compact conformations. 

3.2.2 Thermodynamic stability versus A/F 

As far as the Monte Carlo results are concerned, we estimate the thermal stability of a sequence by 
the width of the transition in V25{T) versus T, as shown in Fig. 0. More precisely, we calculate the 
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Figure 3: Transition width 5 from the Monte-Carlo simulations versus A/T obtained from the 
compact-state spectrum. As explained in the text, the different sets and sub-sets of sequences are 
denoted by different symbols. Typical error bars in the estimation of 5 are shown. The larger error 
bars for large 5 originate from the increase of glassiness as S increases. 



dimensionless transition width as 



^/ T(o.i)-r(o. 
VT(o.i) + r(o. 



(9) 



where T{x) is the temperature for which the probability is x to find the heteropolymer in a state 
with 25 or more native contacts. [This is the inverse function of T'25(T) as plotted on Fig. |2j. We 
have plotted in Fig. ^ the width 5 of the transition as a function of A /F defined by and . We 
have separated the sequences into a set "L(arge)"of sequences with large values of A/T and a set 
"S(mall)" with small values of this ratio. (For further consideration, we have extracted from the "L" 
set two subsets "LA" and "LB" with a fixed A/T ratio, of order 4.5 and 4.7 respectively. Also, the 
set "SA" has been extracted from the "S" set, with a A/T ratio of order 3.7.) It is clear in Fig. |^ 
that in spite of some scatter, the "L", "LA" and "LB" sets, with a high A/T ratio (larger that 4.3) 
correspond to thermodynamically stable sequences, with a narrow transition to the native structure 
(i.e., a small 5 value), whereas the transition to the native structure for the sequences belonging 
to the "S" and "SA" sets {A/T < 4.3) is rather broad. This shows that "protein-like" sequences 
(at least as far as thermodynamics is concerned) correspond to sequences with a large A/T ratio, i. 
e. sequences with a low-energy native state compared to the energy width of the distribution of 
compact states. 

One can also ask how the transition width 6 correlates with A alone, independent of F. It turns 
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Figure 4: A versus A/T for our 45 sequences selection. The symbols are the same as in Fig. 



out that A/r is almost a monotonically increasing function of A, as shown in Fig. ^ F being nearly 
constant at F ~ 2, and therefore A alone is also a good predictor of thermodynamic stability. 

3.3 Thermodynamic stability versus designability 

The designability of a given structure is measured in terms of the number Ns of sequences that 
have this structure as their nondegenerate ground state, and, as proposed in [|^], the sequences 
corresponding to highly designable structures should be "protein-like". Interestingly, the highly 



designable structures show regular helix- like and /3-sheet-like patterns |13], quite reminiscent of the 
regular patterns in real proteins. Moreover, since these highly designable structures have on average 



a large gap to their first compact excited state |13|, they are likely to be thermodynamically stable. 
More specifically, the average gap in the compact state spectrum suddenly jumps from a low value 
to a high value as Ns increases above Ns ~ 1400. It is thus of interest to relate the thermodynamic 
stability of sequences, measured in terms of 6 in equation (^), to the degree of designability of their 
native state structure measured by Ns- We will reach the conclusion that A/F is a better predictor 
of thermodynamic stability than designability. Qualitatively, this could be anticipated since high 
designability of a structure still allows for very different behaviors of the sequences which design that 
structure. 

The correlation between the transition width 6 of the various sequences analyzed here, and the 
number Ns of sequences that design their native structures is plotted in Fig. ^. Clearly, the se- 
quences whose ground states are highly designable structures are likely to be thermodynamically 
stable. However, we observe that several sequences belonging to the sets "S" and "SA" of thermody- 
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namically unstable sequences have a relatively high designability A^^. These sequences, despite high 
designability, have a small value of A/F. Such sequences are expected to emerge due to the fact that, 
even for a structure with large Ns, the ratio A/F is not necessarily large for all sequences having 



that structure as a ground state [^. In other words, it is possible to find atypical sequences with a 
low value of A/F even for structures with a large Ns- For instance, using the compact-state spectra, 
we have plotted in Fig. ^ the distribution of A/F for the top structure (this is the structure with 
the maximal Ns = 3794) and a structure with Ns = 300. On average the sequences corresponding 
to the structure with Ns = 3794 have larger A/F than the sequences corresponding to the structure 
with Ns = 300. However, the tail of the distribution for the Ns = 3794 structure still extends to 
very small A/F. In summary, large A/F ratio is a good predictor that a sequence will be thermody- 
namically stable. A high designability Ns for a structure is consistent with, but does not guarantee, 
large A/F and hence thermodynamic stability of its associated sequences. 

Finally, we would like to comment on the results of Vendruscolo et al. [14|. These authors 
also carried out an analysis of the relation between thermodynamic stability, calculated with all 
the configurations including the open ones, and the number of sequences Ns that design a given 
structure. The authors analyzed the two dimensional case of the present model with 16 monomer 
chains and exactly enumerated not only all the possible compact but also all the possible open 



state conformations. They found first, in agreement with Li et al. |13], a broad distribution of Ns 
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However, they showed that the thermodynamic stability is flat as a function of Ns, in contradiction 
with our present results, and also in contradiction with unpublished data where the gap in 
the compact state spectrum in two dimensions was shown to increase as a function of Ns- In our 
opinion, this discrepancy is due to the fact that in |14] the overall compactness energy Ec in the 
inter-residue interactions was set to zero, thereby weighting open state configurations too strongly 
with respect to compact states. We expect that a larger Ec would have led to an A''5-dependence of 



the thermodynamic stability in Ref. |14|. 



3.4 Conclusions on the thermodynamics 

This section was devoted to the analysis of the thermodynamics of several sequences, selected from 
highly and poorly designable sequences, and with large and small values of A/F. The known compact 
state spectra allow for the calculation of the ratio A/F, which is thought to be proportional to the 



ratio of the folding temperature to the glass temperature |jT8|, 20|, and thus predicts to what extent a 
given sequence is thermodynamically "protein-like" . The Monte-Carlo data calculated with open and 
partially open states also allowed us to characterize the thermodynamic stability of a given sequence, 
in terms of the width 6 of its folding transition. We have shown that these two quantities, 5 and 
A/F, are well correlated, signaling the validity of the compact-state-spectrum analysis. We have also 
shown that stability with respect to mutations (measured in terms of designability) is not directly 
equivalent to thermal stability. 
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Finally, it is worth noting that we eliminated three sequences with partially open ground states. 
These sequences had a very low A/F ratio (2.17, 2.18, 2.52) [this ratio has been calculated with the 
ground state energy £"0 being the lowest compact state energy]. By contrast, none of the simulated 
sequences with a high or moderate A/F ratio were found to have a partially open ground state. This 
suggests that the ground state conformation of some of the remaining sequences with a small A/F 
ratio may not be compact. However, since these sequences are among the most unstable ones, our 
calculation of the transition widths as far as "protein-like" sequences are concerned is unaffected. 



4 Folding simulations 

Besides thermodynamic stability and stability with respect to mutations of the amino-acid code, the 
native state of a "protein-like" sequence should be dynamically accessible. Clearly, the dynamics of 
all the sequences, including the thermodynamically stable ones, becomes glassy at low temperatures 
in our model since all the inter-residue interactions are attractive. As the temperature is lowered, 
the average folding time (the average time required to first reach the native states, starting from 
a stretched conformation) will first decrease and then, below a certain temperature, will start to 
increase drastically, presumably scaling like In ((tj)) ~ 1/T. Conversely, as we have already seen, the 
native state occupancy increases as the temperature is decreased. Since we want to investigate the 
balance between thermodynamic stability and dynamical accessibility of the native state, we chose to 
examine the dynamics at the temperature Tf such that the native state occupancy is fixed at some 
predetermined Vf. The question is then: do the sequences with a large A/F ratio, thermodymic 
stability, and mutational stability also fold fast at Tf ? 

In order to examine this question, we measured the average folding time, namely the average time 
(tj) necessary for the heteropolymer to first reach any of the native states (with 25 ground state 
contacts, out of 28), starting from a stretched conformation. Because in our model all interaction 
energies are attractive, and individual non-native contacts may be as strong as native contacts, the 
folding dynamics is very slow compared to other models. For this reason, we simulated folding of 
the heteropolymers at the temperature Tf such that the average occupancy of the native states is 
only Vf = 10%. Compared to other works where models with a faster folding time were investigated, 
this occupancy of the native states is quite small. For instance, Abkevich et al. |23] could fold a few 



sequences at a temperature for which the average similarity with ground state was up to 95%. Their 
slowest average folding time was then of the order of 2 x 10® MCS, which is one order of magnitude 
smaller than the fastest folding time in our simulations (given that 1 BKL unit = 27 MCS). However, 
models with faster folding times generally require unrealistic assumptions such as artificially lower 
energy for individual native contacts compared to the non-native ones. Even though we could not 
go systematically to lower temperatures, the effect of lowering the temperature was examined for a 
few sequences. 
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Figure 7: Average folding times (in BKL units) versus A/F, with the same symbols as Figs. ^, and 
^. A very slow folding sequence (hereafter denoted by "W(orst)") has not been shown. It belongs to 
the set "S", has A/F = 2.7, and an average folding time of 4.3 x 10^ BKL units. 



4.1 Average folding times 

One folding simulation consists in starting from a stretched heteropolymer, and running the dynamics 
until one of the native states is first reached. The time required is the folding time, which should 
then be averaged over several folding simulations. In practice, we have averaged the folding time 
over 20 folding simulations. For each sequence, the temperature is fixed at Tf (at this temperature, 
the native-states occupancy is Vf = 10%). The average folding times are plotted in Fig. ^ as a 
function of A/F. The relative error in the average folding time can be estimated by noticing that 
the mean value of the folding time distribution is of the order of the standard deviation (we indeed 
calculated the full folding time distribution for a few sequences). The relative error in the estimation 
of the average folding time is then of the order of the inverse square root of the number of folding 
simulations, namely, in our case of l/\/20 ~ 0.2. This precision is sufficient for the purpose of our 
discussion. 

We observe in Fig. ^ that sequences with a large A/F ratio fold fast. We thus conclude, in 
response to question (iii) in the introductory section, that folding dynamics does not add any con- 
straint in the selection of "protein- like" sequences: once a structure is stable against mutations and 
thermodynamically stable (namely, a sequence with a large A/F ratio and a large Ns), it will be a 
fast folder. 

A quite striking result visible in Fig. ^, is that sequences with a low A/F ratio (belonging to the 
"S" and "SA" sets of thermodynamically unstable sequences), may also fold fast, at least as fast as 
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Figure 8: Folding time of the four sequences A, B, C and D for various native-states occupancies. 
The value of A/F for the four sequences A, B, C and D is respectively 5.05, 4.91, 4.1 and 3.07. The 
sequences A and B are thermodynamically stable, and C and D are thermodynamically unstable. 
The two arrows indicate that the corresponding folding times are a lower bound. 

some of the sequences belonging to the "L", "LA" or "LB" sets. However, at lower temperatures 
and hence larger native-states occupancy, we find a stronger correlation between low A/F ratio and 
slow folding. 

In order to go to lower temperatures, we selected four sequences (denoted by A, B, C, and D; A 
and C being the same sequences as in Fig. |2|), with the same fast folding time, 0.14 x 10^ MCS, at 
Tf, i.e. with a 10% occupancy of conformations with 25 or more native contacts. We folded these 
sequences again at temperatures corresponding to larger native-states occupancy. The sequences A 
and B belong to the set "L" of thermodynamically stable sequences, and the sequences C and D belong 
to the set "S" of thermodynamically unstable sequences. As the native-states occupancy is increased, 
we observe in Fig. ^ that the folding times of the two thermodynamically unstable sequences, C and 
D, clearly become much larger than the folding times of the two thermodynamically stable sequences, 
A and B. Regarding the results of our study at a fixed native-states occupancy of 10% (see Fig. 0), 
this suggests that, as the native-states occupancy increases, the average folding times of the sequences 
with small A/F ratio will increase much more rapidly than the folding time of the sequences with 
large A/F ratio. 
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The conclusion regarding the variations of the average folding times as a function of the native- 



states occupancy is similar to the one reached by Abkevich et al. |23], who proposed that the gap A in 
the compact-state spectrum correlates with the average folding time calculated for large native-states 
occupancy. 

4.2 Slow versus fast folding 

Despite the complexity of protein energy landscapes, several authors have proposed simple effective 
descriptions of these landscapes. For instance, the role of traps (or folding intermediates) in the 
folding process was underlined by several authors, for instance by Bryngelson and Wolynes and 
by Klimov and Thirumalai Q. A scenario was proposed by Bryngelson et al. [^5| in which there 
exists a "folding pathway" to the native state: in a first step, the protein collapses via many possible 
paths in phase space, and, in a second step, folds to the native structure via a small number of 



possible paths. Abkevich et al. found evidence for a nucleation process in a lattice model study [26|. 
The aim of the present section is to see if there exists any simple scenario in our H-P model. 

Among all the sequences analyzed here at a "P/ = 10% occupancy of native states, one sequence 
has an anomalously long folding time, approximatively 4 x 10^ MCS. This sequence belongs to the 
set of thermodynamically unstable sequences (A/F = 2.7 and 6 = 0.77). Since this sequence has the 
longest folding time among our sequence selection, we will denote this sequence by "W(orst)" . We 
will compare this very slow folding sequence to a fast folding "protein-like" sequence (sequence A in 
Fig. ^) , and also to a sequence which is thermodynamically unstable but folds fast at a native-states 
occupancy of Vf = 10% (sequence C in Fig. ^). 

4.2.1 Number of low-energy conformations 

In order to obtain more information on low-energy traps, we carry out the following simulation. We 
start from a stretched conformation and let the dynamics evolve until a conformation is first reached 
whose energy above the ground state AE = E — Eq \s smaller than a given AE'hit- We refer to 
the result of a single such simulation as a "hit". The contact matrix of this conformation is then 
recorded. (The contact matrix encodes a compact or nearly compact structure in a unique way. Its 
matrix elements Cjj, with i,j = 1, ...,27, labeling the monomers along the chain, are equal to unity 
if the mononers i and j are in contact, and zero otherwise.) We repeat this simulation N times 
{N = 1500 in practice). We have chosen a small AE'hit = 0.5, but similar conclusions were obtained 
with AE'hit = 1. We finally examine the different structures encoded in the set of contact matrices 
C(^), ... , C(^) of the N "hits". 

We analyzed three sequences: sequence A (a "protein- like" sequence), sequence C (fast folding at 
Vf = 10% occupation of native states, but thermodynamically unstable), and the very slow folding 
sequence W. 
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Table 1: The twelve structures obtained for sequence C. The energy of these structures is AE = 0.3 
(except for the ground state that has AE = 0). The first column is the density of hits (probability 
that a given structure is hit) and the second column is the Hamming distance to the ground state. 
The ground state is indicated in the table. 

We first study the number of different structures for the sequences A, C, and W. It turns out 
that the "protein-like" sequence A, no states other than the ground state were found with an energy 
AE < AE'hit = 0.5. As far as the two other sequences C and W are concerned, we found 12 different 
structures for sequence C and 125 for sequence W. The list of these structures obtained is shown in 
Table |^ for sequence C and Table ^ for sequence W. The Hamming distance to the ground state in 
the second column of these tables is 

27 27 

d(C,C(«^)) = E E |C.,-<f)|, (10) 

1=1 j=i+3 

with C the contact matrix of a given hit and C^^^^ the contact matrix of the ground state structure. 
The contact matrices being symmetric, the summation in ( [lO|) is restricted to the matrix elements 
i < j. The matrix elements j = i + 1 have been discarded since they correspond to covalent bonds; 
the matrix elements j = i + 2 have also been discarded since no contact can be made between these 
monomers. 

We conclude that, going from the "protein-like" sequence A to sequence C and to the very slow 
folder W, one has a spectacular proliferation of different low-energy conformations that can be hit 
starting from a stretched conformation. It is also worth remarking that the ground state structure 
is not the most likely to be hit for either sequence C or W. 

In order to determine if any structures with energy AE < AE^^ = 0.5 have been missed, we 
first notice that for sequence C, the most unlikely hit has been reached 18 times out of = 1500 
simulations (the density of hits for this structure is 1.2%, as shown in Table |^). This number is much 
larger than unity, indicating that all structures of comparable or greater probability have been found. 
For sequence W, the most unlikely hits have been reached only once out of the = 1500 simulations, 
strongly suggesting that some of the low-energy structures have not been found. In order to estimate 
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Table 2: The 125 structures obtained for sequence W. The energy of all these structures is AE = 0.3, 
except for the ground state that has AE' = 0, and all states are highly compact, with a number of 

contacts > 26 out of 28 possible. The first column is the density of hits (probability that a given 
structure is hit first) and the second column is the Hamming distance to the native state. The ground 
state is indicated in the table. The symbol * denotes a compact structure. 
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Figure 9: Variations of the number of hits i versus their rank i. 



the number of missed structures for sequence W, we have plotted in Fig. |^ the number of times a 
structure has been reached, ordered in decreasing order. The tails have been fitted to an exponential 
decay f{i) = 28.5 exp (-i/50) if « < 90 and g{i) = 120 exp (-x/28) if i > 90. An estimate of the 
error in the total number of structures can be obtained by extrapolating g{i) and solving for g{i) = 1, 
which leads to i = 134, compared to the 125 different structures obtained. This indicates that of 
order 10 structures with AE < AE'hit = 0.5 were not reached in the = 1500 simulations. 

We have also compared the states hit in the folding simulation to the complete set of compact 
low-energy states with an energy AE < AE^hit = 0.5 (known from the compact-state enumeration 
[|l3|). In the case of sequence C, we found that all 12 low-energy states in Table |l] are indeed compact, 
and correspond exactly to all the compact states with an energy AE < AE'hit = 0-5. For sequence W, 
we found 85 compact structures with an energy AE < AE'hit = 0.5, a number considerably smaller 
than the 125 structures in Table |2|, which include partially open structures as well. Moreover, five of 
the 85 compact-state structures were not hit in our simulations. It is very plausible that these five 
low-energy compact structures were missed because of poor statistics for the most unlikely hits; this 
is consistent with the earlier statistical estimate of a total of 10 missed structures. 

The comparison between the low-energy compact structures and the hit simulations shows that, 
for the average folding sequence C, the full low-energy phase space is dynamically accessible, though 
some of these low-energy conformations are more likely to be hit (see Table The proliferation of 
different low-energy conformations in the case of the slow folding sequence W closely matches the 
large number of low-energy compact conformations. This is natural because the low energy cut-off 
AE'hit = 0.5 restricts "hits" to compact or very-close-to compact states. 
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Figure 10: Evolution of the similarity S{tQ,T) between the conformation at times to and to + r. 



4.2.2 "Trapping" time in the low energy conformations 

We now turn to the question of how long the protein remains in the first states reached with an 
energy AE < AE'hit- In particular, we would hke to compare the magnitudes of the "trapping times" 
in these low-energy states. To do so, we start from a stretched conformation and let the dynamics 
evolve until a state with energy AE < AE'hit is first met at time Iq. We then calculate the similarity 
S{to, t) between the conformation at time to and at a later time to+T. This quantity is then averaged 
over N = 1500 complete simulations. The results are plotted in Fig. |l^ for the sequences C and W, 
for AE'hit = 0.5. As is visible in this figure, the typical "trapping time" in a conformation with an 
energy AE < AE'hit differs by only a factor of 2 or 3 between the sequences C and W, compared 
to a factor of 30 difference in folding times. This observation further confirms that the slow folding 
of sequence W does not originate from the trapping in a few "valleys" in phase space with a very 
long trapping time. On the contrary, the trapping time of the average folder C and the very slow 
folder W is of the same order of magnitude and slow folding seems to originate primarily from the 
profusion of low-energy conformations for sequence W. 

5 Conclusion 

Let us now summarize our answers to the three questions presented in the introduction. 

• (i) We have investigated the relation between compact-state-spectrum predictions, namely the 
A/r criterion, Eqs. and (^), and thermodynamic stability, determined from Monte-Carlo 
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simulations where all the states (not only the compact ones) are taken into account. We find 
that for high A/T the predictions of the compact-state-spectrum analysis are in good agree- 
ment with the Monte-Carlo simulations. Namely, sequences that have a high A/T ratio also 
have a sharp transition to the native state conformations ( "thermodynamically stable"). For 
sequences with low and intermediate A/T ratio, the transition widths vary considerably and 
are generally broader. This suggests that low energy open configurations can be important in 
determining 5 in this case. We find that designability (defined as the number of sequences that 
have a given structure as their nondegenerate ground state) is not in one-to-one correspondence 
with thermodynamic stability. Some sequences with highly designable ground states are ther- 
modynamically unstable. However, we did find that sequences with highly designable ground 
states have on average a large A/T ratio, and that sequences with poorly designable ground 
states have on average a small A/T ratio. 

• (ii) The folding simulations have shown that sequences with a large A/T ratio fold fast. Se- 
quences with a low A/T ratio, and therefore sequences which are thermodynamically unstable, 
may fold slow or fast at the relatively high temperatures that we investigated. At these temper- 
atures the protein spends only 10% of the time near its ground-state configuration. We have 
argued, with several examples, that at lower temperatures the thermodynamically unstable 
sequences are likely to become slow folders whereas the stable ones are likely to continue to 
fold fast. 

• (iii) To be "protein-like" a sequence must first be thermodynamically stable, which follows if 
it has a large A/T ratio. Second, the sequence must be mutationally stable, which follows if it 
has a highly designable ground state. We find that the third requirement to be "protein-like" , 
namely fast folding, does not introduce additional constraints on sequence selection. Once a 
sequence has been designed to be thermodynamically stable (large A/T) and stable against 
mutations (large Ns) its folding dynamics is fast. 

We have also explored the reason for the slow folding of a sequence with an anomalously large 
average folding time. We found that the "traping time" in a given low energy state is of the same 
order of magnitude for both fast and slow sequences, and cannot explain the large differences in the 
average folding times. Instead, we have shown that the slow folder visits a large number of different 
low-energy states, whereas only a few low-energy states are visited by the fast folder. This suggests 
that slow folding dynamics originates from a large number of low energy conformations. 

Finally, we make a few remarks to illustrate the magnitude of structure and sequence selection. 
We started from a set of 2^^ sequences, only 4.75% (approximately 6, 000, 000) of these having a 



nondegenerate compact ground-state ||13|] structure. The total number of possible compact structures 



is 51, 704 1 13]. Among all these structures, only 60 are highly designable (estimated from the jump of 
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the average compact-state-spectrum gap as Ns increases above ~ 1400 ||T^). The 60 highly designable 
structures are designed by a total of 128,320 sequences, only 15% of which have A/T > 4.3 and are 
thus expected to be "protein-like" folders. In summary, only about 0.01% of the initial 2^^ sequences 
satisfy all the requirements for "protein-like" behavior. 
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